A Simple Model for Forecasting Housing Prices

We want to start simply, so we'll just look at a basic forecast of the national Case-Schiller index based on population, and then add a few other variables. Schiller provides the data for this on the webpage for his book Irrational Exuberance.

We'll use pymc, a well-developed and maintained python library for bayesian statistical analysis, especially Markov Chain Monte Carlo (MCMC).

Other libraries used here include pandas and numpy. matplotlib is used for plotting, and I use a custom configuration file based on the one used in Bayesian Programming for Hackers, a great book with tons of good information on bayesian methods and graphing.

So first we need to import modules:



In [28]:

    
#Modules
import pymc as pm
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
from IPython.core.pylabtools import figsize
%matplotlib inline

Then let's load our data. I've made a few quick changes to convert the original file named "Fig2-1.xls" so that it is easy to import:



In [208]:

    
#Data file
datafile = '/home/potterzot/data/irrational_exuberance/prices_population.csv'
data = pd.read_csv(datafile)

data.columns









    Out[208]:





Index(['date', 'real_home_price_idx', 'date2', 'real_build_cost_idx', 'us_pop', 'long_rate', 'long_rate_src', 'date3', 'hpi', 'hpi_src', 'date4', 'nom_build_cost_idx', 'nom_build_cost_src', 'date5', 'cpi', 'cpi_src', 'date6', 'cpi_annual'], dtype='object')

Before we do anything, let's look at a few samples of what our data look like:



In [234]:

    
data.loc[0:10,:]









    Out[234]:






  
    
      
      date
      real_home_price_idx
      date2
      real_build_cost_idx
      us_pop
      long_rate
      long_rate_src
      date3
      hpi
      hpi_src
      date4
      nom_build_cost_idx
      nom_build_cost_src
      date5
      cpi
      cpi_src
      date6
      cpi_annual
    
  
  
    
      0 
       1890
       100.000000
       1890
       51.362340
       63.056
       3.42
       Homer
       1890
       3.657175
       Grebler
       1890
       NaN
       Grebler
       1890
       7.611652
       Warren&Pearson
       1890
       7.611652
    
    
      1 
       1891
        88.011791
       1891
       47.266163
       64.361
       3.62
       Homer
       1891
       3.299213
       Grebler
       1891
       NaN
       Grebler
       1891
       7.801942
       Warren&Pearson
       1891
       7.801942
    
    
      2 
       1892
        95.421736
       1892
       52.048155
       65.666
       3.60
       Homer
       1892
       3.358873
       Grebler
       1892
       NaN
       Grebler
       1892
       7.326213
       Warren&Pearson
       1892
       7.326213
    
    
      3 
       1893
        92.297385
       1893
       44.673332
       66.970
       3.75
       Homer
       1893
       3.502058
       Grebler
       1893
       NaN
       Grebler
       1893
       7.897091
       Warren&Pearson
       1893
       7.897091
    
    
      4 
       1894
       123.980483
       1894
       57.263434
       68.275
       3.70
       Homer
       1894
       4.080763
       Grebler
       1894
       NaN
       Grebler
       1894
       6.850483
       Warren&Pearson
       1894
       6.850483
    
    
      5 
       1895
       117.455092
       1895
       61.470338
       69.580
       3.46
       Homer
       1895
       3.704903
       Grebler
       1895
       NaN
       Grebler
       1895
       6.565052
       Warren&Pearson
       1895
       6.565052
    
    
      6 
       1896
       100.302990
       1896
       60.068945
       70.885
       3.60
       Homer
       1896
       3.209723
       Grebler
       1896
       NaN
       Grebler
       1896
       6.660193
       Warren&Pearson
       1896
       6.660193
    
    
      7 
       1897
       106.515703
       1897
       62.384893
       72.189
       3.40
       Homer
       1897
       3.311145
       Grebler
       1897
       NaN
       Grebler
       1897
       6.469903
       Warren&Pearson
       1897
       6.469903
    
    
      8 
       1898
       110.184140
       1898
       61.438038
       73.494
       3.35
       Homer
       1898
       3.525922
       Grebler
       1898
       NaN
       Grebler
       1898
       6.660193
       Warren&Pearson
       1898
       6.660193
    
    
      9 
       1899
       103.853113
       1899
       64.044606
       74.799
       3.10
       Homer
       1899
       3.370805
       Grebler
       1899
       NaN
       Grebler
       1899
       6.755342
       Warren&Pearson
       1899
       6.755342
    
    
      10
       1900
       101.574295
       1900
       49.420634
       76.094
       3.15
       Homer
       1900
       3.854054
       Grebler
       1900
       NaN
       Grebler
       1900
       7.897091
       Warren&Pearson
       1900
       7.897091



In [265]:

    
#Plot real house prices and population versus date
z = np.polyfit(data['date'],data['real_home_price_idx'],1)
p = np.poly1d(z)
figsize(12, 4)
plt.title("National Real Home Price Index and Population")
plt.plot(data.loc[data['date']>=1960, 'date'], data.loc[data['date']>=1960,'real_home_price_idx'], label="CS Index")
plt.plot(data.loc[data['date']>=1960, 'date'], p(data.loc[data['date']>=1960, 'date']), color='grey', linewidth=1)
plt.plot(data.loc[data['date2']>=1960, 'date2'], data.loc[data['date2']>=1960, 'us_pop'], label="Population")
plt.ylim(0,350)
legend = plt.legend()
legend.get_frame().set_alpha(0.4)
plt.tight_layout()
plt.show()

Notice that, although the scale here is scrunched together pretty badly, it seems like the population and the trendline for housing prices are fairly similar. Let's look at that more closely.



In [266]:

    
plt.clf()
figsize(12, 4)
plt.plot(data['date'], p(data['date']), label="Housing Price Trendline")
plt.plot(data['date2'], data.loc[:,'us_pop']/10+68, label="Population") #Here we scale and center the population
plt.title("Housing Price Trendline and US Population, Scaled and Centered")
plt.ylim(0,150)
legend = plt.legend()
legend.get_frame().set_alpha(0.4)
plt.tight_layout()
plt.show()

We can see that housing prices have increased somewhat more quickly than population, so population may not be a great predictor, but it's not a terrible place to start with a simple model. It makes sense that as population increases, the price of housing might also be expected to increases, especially in spatially-constrained geographies like dense urban-areas.

We'll refine the model substantially later, but starting with a simple model will allow us to get the basic framework of the code down. It can always be made more complicated, but starting with something complicated can make development quite a bit slower.

Setting up a Forecasting Model

We'll use a bayesian approach to the forecast, in an attempt to estimate a percent change in price in each year with a sense of how confident we are in the estimate.

Our prior can just be the average annual percent growth from 1890 to 2014.



In [258]:

    
#Set up an annual data data file
adata = pd.DataFrame()
adata['date'] = data.loc[data['date']<=1950, 'date']
adata['rhpi'] = data.loc[data['date']<=1950, 'real_home_price_idx']
adata['pop'] = data.loc[data['date']<=1950, 'us_pop']

#Quarterly data
#qdata = pd.DataFrame()
#qdata['date'] = data.loc[data['date']>=1960, 'date']
#qdata['rhpi'] = data.loc[data['date']>=1960, 'real_home_price_idx']
#qdata['year'] = np.floor(qdata['date'])
#qdata['pop'] = data.loc[data['date2']>=qdata['year'], 'us_pop']
#qdata

#Average annual change
last_year = adata.loc[adata.index[-1],'date']
first_year = adata.loc[0, 'date']
n_years = last_year-first_year
rhpi = adata['rhpi']
delta_date = adata['date'].diff()
delta_rhpi = rhpi.diff()[1:]
avg_annual_chg = delta_rhpi.sum()/(n_years)
pct_chg = delta_rhpi / np.array(rhpi[0:-1])

#Compound rate of change

avg_rate_chg = (rhpi[rhpi.index[-1]]/rhpi[0])**(1/(n_years)) - 1
{'actual': rhpi[rhpi.index[-1]], 'estimate': rhpi[0]*(1+avg_rate_chg)**(n_years)} #check that our estimate is correct









    Out[258]:





{'estimate': 105.89483934250032, 'actual': 105.89483934250001}



In [319]:

    
#Plot the year to year % changes
plt.clf()
figsize(12, 4)
plt.title("Percent change from period to period")
plt.bar(adata.loc[1:,'date'], pct_chg*100)
locs,labels = plt.yticks()
plt.yticks(locs, list(map((lambda x: "{}%".format(int(x))), locs)))
plt.show()

But what's the distribution of these percent changes? If we want to capture the way it tends to move, let's look at a histogram of the percent change:



In [407]:

    
#mean and standard deviation
m = pct_chg.mean()
sd = 10/(np.std(pct_chg))

# Create a normal distribution
p = pm.Normal('p', m, sd)
mdl_pct_chg = np.array([p.random() for i in np.arange(0,10000)])

#
plt.clf()
figsize(12, 4)
hist1, bins1 = np.histogram(pct_chg, bins=15)
hist2, bins2 = np.histogram(mdl_pct_chg, bins=15)
width = 0.7 * (bins[1] - bins[0])
center1 = (bins[:-1] + bins[1:]) / 2
center2 = (bins2[:-1] + bins2[1:]) / 2
plt.bar(center1, hist1, align='center', width=width)
plt.plot(center2, hist2*len(pct_chg)/10000)
plt.title("Histogram of Percent Change")
plt.show()



In [398]:

    
#Now create a MCMC model...
#@pm.deterministic
#def m_():
#def sd_():
m_ = pm.Normal('m', 0, 1)
sd_ = pm.Normal('sd', 0, 1)

observations = pm.Normal("obs", m, sd, value=pct_chg, observed=True)
model = pm.Model([observations, m_, sd_])



In [399]:

    
mcmc = pm.MCMC(model)
mcmc.sample(40000, 10000, 1)









    



 [-----------------100%-----------------] 40000 of 40000 complete in 3.7 sec



In [437]:

    
#Get the traces, predicted points of the shape of m and sd
m_samples = mcmc.trace('m')[:]
sd_samples = mcmc.trace('sd')[:]

# histogram of the samples:
plt.clf()
figsize(12.5, 8)

ax = plt.subplot(211)
ax.set_autoscaley_on(False)

plt.hist(m_samples, histtype='stepfilled', bins=30, alpha=0.85,
         label="posterior of $\mu$", color="#A60628", normed=True)
plt.legend(loc="upper left")
plt.title(r"""Posterior distributions of the variables
    $\mu,\;\tau$""")
plt.xlim([-3, 3])
plt.xlabel("$\mu$ value")

ax = plt.subplot(212)
ax.set_autoscaley_on(False)
plt.hist(sd_samples, histtype='stepfilled', bins=30, alpha=0.85,
         label="posterior of $\tau$", color="#7A68A6", normed=True)
plt.legend(loc="upper left")
plt.xlim([-3, 3])
plt.xlabel("$\tau$ value")

plt.show()

That probably doesn't seem like much, except that now we have an estimate of the two variables that we think are affecting the underlying percent change in housing price, assuming the change is just a random variable from a normal distribution.

But that's not really very interesting. That just says that prices change randomly as defined by our normal distribution. What we hope is that we can determine a series of factors that affect the change. Let's begin by assuming that the change in price is dependent on the change in population in the previous year.

	date	real_home_price_idx	date2	real_build_cost_idx	us_pop	long_rate	long_rate_src	date3	hpi	hpi_src	date4	nom_build_cost_idx	nom_build_cost_src	date5	cpi	cpi_src	date6	cpi_annual
0	1890	100.000000	1890	51.362340	63.056	3.42	Homer	1890	3.657175	Grebler	1890	NaN	Grebler	1890	7.611652	Warren&Pearson	1890	7.611652
1	1891	88.011791	1891	47.266163	64.361	3.62	Homer	1891	3.299213	Grebler	1891	NaN	Grebler	1891	7.801942	Warren&Pearson	1891	7.801942
2	1892	95.421736	1892	52.048155	65.666	3.60	Homer	1892	3.358873	Grebler	1892	NaN	Grebler	1892	7.326213	Warren&Pearson	1892	7.326213
3	1893	92.297385	1893	44.673332	66.970	3.75	Homer	1893	3.502058	Grebler	1893	NaN	Grebler	1893	7.897091	Warren&Pearson	1893	7.897091
4	1894	123.980483	1894	57.263434	68.275	3.70	Homer	1894	4.080763	Grebler	1894	NaN	Grebler	1894	6.850483	Warren&Pearson	1894	6.850483
5	1895	117.455092	1895	61.470338	69.580	3.46	Homer	1895	3.704903	Grebler	1895	NaN	Grebler	1895	6.565052	Warren&Pearson	1895	6.565052
6	1896	100.302990	1896	60.068945	70.885	3.60	Homer	1896	3.209723	Grebler	1896	NaN	Grebler	1896	6.660193	Warren&Pearson	1896	6.660193
7	1897	106.515703	1897	62.384893	72.189	3.40	Homer	1897	3.311145	Grebler	1897	NaN	Grebler	1897	6.469903	Warren&Pearson	1897	6.469903
8	1898	110.184140	1898	61.438038	73.494	3.35	Homer	1898	3.525922	Grebler	1898	NaN	Grebler	1898	6.660193	Warren&Pearson	1898	6.660193
9	1899	103.853113	1899	64.044606	74.799	3.10	Homer	1899	3.370805	Grebler	1899	NaN	Grebler	1899	6.755342	Warren&Pearson	1899	6.755342
10	1900	101.574295	1900	49.420634	76.094	3.15	Homer	1900	3.854054	Grebler	1900	NaN	Grebler	1900	7.897091	Warren&Pearson	1900	7.897091